Note: I have only done the EDA to answer the asked questions. I have not done any EDA for the purpose of feature engineering or feature selection.
## column_name nas_count nas_percent
## 1 checking_account_status 0 0
## 2 duration_in_months 0 0
## 3 credit_history 0 0
## 4 purpose 0 0
## 5 credit_amount 0 0
## 6 savings_account_status 0 0
## 7 present_employment_since 0 0
## 8 installment_as_percent_of_income 0 0
## 9 marital_sex_type 0 0
## 10 role_in_other_credits 0 0
## 11 present_resident_since 0 0
## 12 assset_type 0 0
## 13 age 0 0
## 14 other_installment_plans 0 0
## 15 housing_type 0 0
## 16 count_existing_credits 0 0
## 17 employment_type 0 0
## 18 count_dependents 0 0
## 19 has_telephone 0 0
## 20 is_foreign_worker 0 0
## 21 is_credit_worthy 0 0
So, no missing data. Yayyy!
Before going into exploring relationship of predictors with the target, let’s first clearly define the target
Credit worthiness for a group of observations can be measured by Good/Total proportion. Higher the proportion, higher the credit worthiness
Question: Would a person with critical credit history, be more credit worthy?
Again, let’s first define what critical means. In the absence of any concrete definition, I will assume ‘critical’ roughly means more existing credits i.e. it increase from A30 to A35
## `summarise()` ungrouping output (override with `.groups` argument)
Critical has positive association with credit worthiness
Q. Are young people more creditworthy?
## `summarise()` ungrouping output (override with `.groups` argument)
The distributions are quite overlapping. But there are more young in “Bad” compared to “Good”, and that is also visible in the difference in means. > So, young people seem slightly less credit worthy.
But let’s break the age into groups to see finer details
## `summarise()` ungrouping output (override with `.groups` argument)
“Bad” is quite low for the (34, 39] age group
Q. Would a person with more credit accounts, be more credit worthy?
I am assuming more credit accounts is same as “Number of existing credits at this bank” i.e. ‘count_existing_credits’
Data is too unreliable to say anything on the relationship between no. of credit accounts and credit worthiness
Consequently, there is no feature engineering.
For feature engineering I have used Boruta, which I have found to be the best feature selection technique almost always. Below is how the Boruta plot looks like:
Selected features are:
## [1] "checking_account_status" "duration_in_months"
## [3] "credit_history" "purpose"
## [5] "credit_amount" "savings_account_status"
## [7] "present_employment_since" "installment_as_percent_of_income"
## [9] "role_in_other_credits" "assset_type"
## [11] "age" "other_installment_plans"
## [13] "housing_type" "employment_type"
## [15] "is_credit_worthy"
It is worse to class a customer as ‘Good’ when they are ‘Bad’, than it is to class a customer as bad when they are good.
Let ‘Good’ be the positive class, and ‘Bad’ be the negative class. So the above statement will translate to:
> False Positives (FPs) are more expensive than False Negatives (FNs)
Such cases fall under **Cost Sensitive Learning" strategy, and followong sub-strategies can be followed decided under it:
I will try the following three models: - Logistic Regression - Boosted Trees: GBM - Random Forest
I will go with a Custom evaluation metric:
I have assigned follwing weights to different buckets of the confusion matrix to penalize each bucket differently
## Reference
## Prediction Good Bad
## Good -0.4 1
## Bad 0.2 0
There is no particular reason for these values, just their relative differences are important because they penalize FPs more than FNs. PLus, I am rewarding TPs (True Positives)
Now, the custom metric is just the normalized sum-product of these weights and the confusion matrix of the model. Let’s call it “credit_cost”.
I have 80:20 splitting. For validation, I will be using cross-validation wherever required.
I am taking baseline as predicting everybody as "Good’
Train credit_cost
## Baseline Train Cost: 0.0206982543640898
## Baseline Train Precision: 0.699501246882793
Test credit_cost
## Baseline Test Cost: 0.0171717171717172
## Baseline Test Precision: 0.702020202020202
Train Results:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 518 116
## Bad 43 125
##
## Accuracy : 0.802
## 95% CI : (0.772, 0.829)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.0000000000343
##
## Kappa : 0.484
##
## Mcnemar's Test P-Value : 0.0000000112995
##
## Sensitivity : 0.923
## Specificity : 0.519
## Pos Pred Value : 0.817
## Neg Pred Value : 0.744
## Prevalence : 0.700
## Detection Rate : 0.646
## Detection Prevalence : 0.791
## Balanced Accuracy : 0.721
##
## 'Positive' Class : Good
##
Train Results:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 543 16
## Bad 18 225
##
## Accuracy : 0.958
## 95% CI : (0.941, 0.97)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.899
##
## Mcnemar's Test P-Value : 0.864
##
## Sensitivity : 0.968
## Specificity : 0.934
## Pos Pred Value : 0.971
## Neg Pred Value : 0.926
## Prevalence : 0.700
## Detection Rate : 0.677
## Detection Prevalence : 0.697
## Balanced Accuracy : 0.951
##
## 'Positive' Class : Good
##
Train Results:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 535 52
## Bad 26 189
##
## Accuracy : 0.903
## 95% CI : (0.88, 0.922)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.761
##
## Mcnemar's Test P-Value : 0.00464
##
## Sensitivity : 0.954
## Specificity : 0.784
## Pos Pred Value : 0.911
## Neg Pred Value : 0.879
## Prevalence : 0.700
## Detection Rate : 0.667
## Detection Prevalence : 0.732
## Balanced Accuracy : 0.869
##
## 'Positive' Class : Good
##
## models train_credit_cost train_precision test_credit_cost
## 1 baseline 0.0207 0.6995 0.01717
## 2 Logistic Regression -0.1012 0.8170 -0.07475
## 3 GBM -0.2509 0.9714 -0.08586
## 4 Random Forest -0.2087 0.9114 -0.10505
## test_precision
## 1 0.7020
## 2 0.8013
## 3 0.8309
## 4 0.8605
Credit_cost and Pricision are in sync.
train results are best for GBM. But its overfitting, i.e. variance is high, so not that great results on test.
test results are best for Random Forest. It has less variance then GBM, but bias is higher.
It may seem like that GBM is a better model, but we still haven’t seen the uncertainity (variance) in the results. Difference between train and test set results give some idea about it, but its better to see it on cross-validated results.
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: gbm_grid_11_model_3
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 16282 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 12 27 21.26000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.04534
## RMSE: 0.2129
## LogLoss: 0.1929
## Mean Per-Class Error: 0.04519
## AUC: 0.9927
## AUCPR: 0.9953
## Gini: 0.9853
## R^2: 0.7393
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Bad Good Error Rate
## Bad 225 16 0.066390 =16/241
## Good 20 814 0.023981 =20/834
## Totals 245 830 0.033488 =36/1075
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.565630 0.978365 233
## 2 max f2 0.379781 0.986266 274
## 3 max f0point5 0.624613 0.985565 212
## 4 max accuracy 0.585040 0.966512 227
## 5 max precision 0.989265 1.000000 0
## 6 max recall 0.321050 1.000000 287
## 7 max specificity 0.989265 1.000000 0
## 8 max absolute_mcc 0.585040 0.906286 227
## 9 max min_per_class_accuracy 0.603570 0.962656 221
## 10 max mean_per_class_accuracy 0.624613 0.966521 212
## 11 max tns 0.989265 241.000000 0
## 12 max fns 0.989265 832.000000 0
## 13 max fps 0.020773 241.000000 399
## 14 max tps 0.321050 834.000000 287
## 15 max tnr 0.989265 1.000000 0
## 16 max fnr 0.989265 0.997602 0
## 17 max fpr 0.020773 1.000000 399
## 18 max tpr 0.321050 1.000000 287
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1689
## RMSE: 0.4109
## LogLoss: 0.5094
## Mean Per-Class Error: 0.4049
## AUC: 0.7906
## AUCPR: 0.8921
## Gini: 0.5812
## R^2: 0.1967
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Bad Good Error Rate
## Bad 54 187 0.775934 =187/241
## Good 19 542 0.033868 =19/561
## Totals 73 729 0.256858 =206/802
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.219487 0.840310 347
## 2 max f2 0.109460 0.922619 381
## 3 max f0point5 0.606021 0.834918 216
## 4 max accuracy 0.443982 0.754364 268
## 5 max precision 0.991849 1.000000 0
## 6 max recall 0.045964 1.000000 396
## 7 max specificity 0.991849 1.000000 0
## 8 max absolute_mcc 0.606021 0.439563 216
## 9 max min_per_class_accuracy 0.672135 0.725490 186
## 10 max mean_per_class_accuracy 0.606021 0.729440 216
## 11 max tns 0.991849 241.000000 0
## 12 max fns 0.991849 560.000000 0
## 13 max fps 0.024140 241.000000 399
## 14 max tps 0.045964 561.000000 396
## 15 max tnr 0.991849 1.000000 0
## 16 max fnr 0.991849 0.998217 0
## 17 max fpr 0.024140 1.000000 399
## 18 max tpr 0.045964 1.000000 396
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.7699316 0.041614145 0.78443116 0.7939394 0.7051282 0.8113208
## auc 0.7899245 0.036268797 0.7916667 0.8346235 0.75 0.8153495
## aucpr 0.88372415 0.02987459 0.89833695 0.91614044 0.8630901 0.8981989
## err 0.23006836 0.041614145 0.21556886 0.2060606 0.2948718 0.18867925
## err_count 36.8 5.9329586 36.0 34.0 46.0 30.0
## cv_5_valid
## accuracy 0.7548387
## auc 0.75798285
## aucpr 0.8428545
## err 0.2451613
## err_count 38.0
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## pr_auc 0.88372415 0.02987459 0.89833695 0.91614044 0.8630901 0.8981989
## precision 0.77198565 0.04895599 0.7837838 0.7887324 0.69736844 0.83064514
## r2 0.19584712 0.08572516 0.20389102 0.28802457 0.07672223 0.2621514
## recall 0.9591504 0.029826047 0.96666664 0.9655172 1.0 0.91964287
## rmse 0.41076636 0.02622249 0.40124473 0.38554546 0.4484144 0.39196244
## specificity 0.33468577 0.17005084 0.31914893 0.3877551 0.08 0.5531915
## cv_5_valid
## pr_auc 0.8428545
## precision 0.7593985
## r2 0.1484464
## recall 0.94392526
## rmse 0.4266648
## specificity 0.33333334
## Model Details:
## ==============
##
## H2OBinomialModel: drf
## Model ID: drf_grid_11_model_4
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 300 300 178068 6
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 6 6.00000 28 55 42.49000
##
##
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 0.167
## RMSE: 0.4086
## LogLoss: 0.5039
## Mean Per-Class Error: 0.2986
## AUC: 0.7958
## AUCPR: 0.8948
## Gini: 0.5916
## R^2: 0.2057
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Bad Good Error Rate
## Bad 128 113 0.468880 =113/241
## Good 72 489 0.128342 =72/561
## Totals 200 602 0.230673 =185/802
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.588809 0.840929 274
## 2 max f2 0.263705 0.921788 396
## 3 max f0point5 0.670701 0.834331 217
## 4 max accuracy 0.588809 0.769327 274
## 5 max precision 0.969245 1.000000 0
## 6 max recall 0.263705 1.000000 396
## 7 max specificity 0.969245 1.000000 0
## 8 max absolute_mcc 0.620096 0.435152 251
## 9 max min_per_class_accuracy 0.676920 0.729055 213
## 10 max mean_per_class_accuracy 0.700984 0.734055 192
## 11 max tns 0.969245 241.000000 0
## 12 max fns 0.969245 560.000000 0
## 13 max fps 0.172355 241.000000 399
## 14 max tps 0.263705 561.000000 396
## 15 max tnr 0.969245 1.000000 0
## 16 max fnr 0.969245 0.998217 0
## 17 max fpr 0.172355 1.000000 399
## 18 max tpr 0.263705 1.000000 396
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: drf
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1673
## RMSE: 0.409
## LogLoss: 0.5034
## Mean Per-Class Error: 0.3413
## AUC: 0.7942
## AUCPR: 0.8968
## Gini: 0.5885
## R^2: 0.2041
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Bad Good Error Rate
## Bad 101 140 0.580913 =140/241
## Good 57 504 0.101604 =57/561
## Totals 158 644 0.245636 =197/802
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.563329 0.836515 298
## 2 max f2 0.367694 0.922747 382
## 3 max f0point5 0.656225 0.834299 231
## 4 max accuracy 0.569465 0.754364 293
## 5 max precision 0.966395 1.000000 0
## 6 max recall 0.309813 1.000000 393
## 7 max specificity 0.966395 1.000000 0
## 8 max absolute_mcc 0.654595 0.436647 233
## 9 max min_per_class_accuracy 0.677673 0.718360 213
## 10 max mean_per_class_accuracy 0.656225 0.729425 231
## 11 max tns 0.966395 241.000000 0
## 12 max fns 0.966395 560.000000 0
## 13 max fps 0.224467 241.000000 399
## 14 max tps 0.309813 561.000000 393
## 15 max tnr 0.966395 1.000000 0
## 16 max fnr 0.966395 0.998217 0
## 17 max fpr 0.224467 1.000000 399
## 18 max tpr 0.309813 1.000000 393
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.7579888 0.029784564 0.7305389 0.7878788 0.74358976 0.7924528
## auc 0.7938621 0.023601508 0.79468083 0.8224842 0.76584905 0.8107903
## aucpr 0.8911885 0.020987421 0.9034187 0.9098033 0.86775553 0.9060463
## err 0.24201117 0.029784564 0.26946107 0.21212122 0.25641027 0.20754717
## err_count 38.8 4.816638 45.0 35.0 40.0 33.0
## cv_5_valid
## accuracy 0.7354839
## auc 0.77550626
## aucpr 0.8689187
## err 0.26451612
## err_count 41.0
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## pr_auc 0.8911885 0.020987421 0.9034187 0.9098033 0.86775553 0.9060463
## precision 0.7727225 0.047828298 0.72727275 0.8292683 0.7619048 0.816
## r2 0.20340157 0.024107175 0.19744903 0.22976469 0.17218669 0.22555843
## recall 0.9353987 0.05225089 1.0 0.87931037 0.9056604 0.91071427
## rmse 0.40912727 0.010530577 0.40286487 0.40100962 0.42459956 0.40156436
## specificity 0.342424 0.22247364 0.04255319 0.5714286 0.4 0.5106383
## cv_5_valid
## pr_auc 0.8689187
## precision 0.7291667
## r2 0.19204898
## recall 0.9813084
## rmse 0.4155979
## specificity 0.1875
Not much difference here too, DRF seems only slightly better but that may change with fold assignment. For GBM, I did positive class upsample tuning but didn’t tune other hyperparameters. And for DRF I did the exact opposite. So, both the models have a lot of scope of tuning, and I am not at a stage to pick the right model
We can see feature importance of either GBM or DRF, but DRF gives a better plot without breaking categorical features into its classes, so we will use DRF.
Topp-3 features are “checking_account_status”, “duration_in_months”, and “credit_amount”
To profile a ‘Good’ credit worthy person as per the model, let’s explore the relationship of top predictors with the predicted class for the DRF model.
##
|
| | 0%
|
|======================================================================| 100%
##
|
| | 0%
|
|======================================================================| 100%
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
So, the best credit worthy person would have a following profile:
- checking_account_status is “A14” i.e. no checking account
- duration_in_months is less than 12 month i.e. a year
- credit_amount is less than 2k
- credit_history is “A34” i.e. critical account/other existing credits
- Purpose is A43 i.e. radio/television
This seems slightly unintuitive, but I will have to go into model explainibility to get better insights, and currently the time is short for that